This uses transformers
, which is IIUC way less efficient for inference than e.g. vllm
, to an extent that is probably unacceptable for production usecases.
Logical counterfactuals are when you say something like "Suppose , what would that imply?"
They play an important role in logical decision theory.
Suppose you take a false proposition and then take a logical counterfactual in which is true. I am imagining this counterfactual as a function that sends counterfactually true statements to 1 and false statements to 0.
Suppose is " not Fermat's last theorem". In the counterfactual where Fermat's last theorem is false, I would still expect 2+2=4. Perhaps not with measure 1, but close. So
On the other hand, I would expect trivial rephrasing of Fermat's last theorem to be false, or at least mostly false.
But does this counterfactual produce a specific counter example? Does it think that ? Or does it do something where the counterfactual insists a counter-example exists, but...
There is a model of bounded rationality, logical induction.
Can that be used to handle logical counterfactuals?
I hear a lot of different arguments floating around for exactly how mechanistically interpretability research will reduce x-risk. As an interpretability researcher, forming clearer thoughts on this is pretty important to me! As a preliminary step, I've compiled a list with a longlist of 19 different arguments I've heard for why interpretability matters. These are pretty scattered and early stage thoughts (and emphatically my personal opinion than the official opinion of Anthropic!), but I'm sharing them in the hopes that this is interesting to people
(Note: I have not thought hard about this categorisation! Some of these overlap substantially, but feel subtly different in my head. I was not optimising for concision and having few categories, and expect I could cut this down substantially with effort)
Credit to Evan...
TL;DR: In September 2024, OpenAI released o1, its first "reasoning model". This model exhibits remarkable test-time scaling laws, which complete a missing piece of the Bitter Lesson and open up a new axis for scaling compute. Following Rush and Ritter (2024) and Brown (2024a, 2024b), I explore four hypotheses for how o1 works and discuss some implications for future scaling and recursive self-improvement.
The Bitter Lesson is that "general methods that leverage computation are ultimately the most effective, and by a large margin." After a decade of scaling pretraining, it's easy to forget this lesson is not just about learning; it's also about search.
OpenAI didn't forget. Their new "reasoning model" o1 has figured out how to scale search during inference time. This does not use explicit...
I could imagine other ways you might choose to instead train your untrusted monitor, which could benefit from debate:
- You train a general-purpose reasoner (via debate), and simply prompt it to do untrusted monitoring. This could do better than a trained untrusted monitor because it is less likely to overfit to spurious features of synthetically generated dangerous actions. (You instead use the synthetically generated dangerous actions to validate your untrusted monitor, or to hill climb on your prompt.)
- You do train on synthetically generated dangerous actio
... (read more)